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ABSTRACT 


Title  of  Thesis:  BIOLOGICAL  VERSUS  SUBSPACE  METHODS 

IN  SOUND  LOCALIZATION 

Saurabh  Dadu,  Master  of  Science,  2001 

Thesis  directed  by:  Professor  P.  S.  Krishnaprasad 

Department  of  Electrical  and  Computer  Engineering 


Sound  localization  is  determining  the  location  of  sound  sources  using  the  mea¬ 
surements  of  the  signals  received  by  an  array  of  sensors.  Humans  and  animals 
possess  the  natural  ability  of  localizing  sound.  Researchers  have  tried  to  model 
nature’s  way  of  solving  this  problem  and  have  come  up  with  different  methods 
based  on  various  neuro-physiological  studies.  Such  methods  are  called  biological 
methods.  On  the  other  hand,  there  is  another  community  of  researchers  who  has 
looked  at  this  problem  from  pure  signal  processing  point  of  view.  Among  the  more 
popular  methods  for  solving  this  problem  using  signal  processing  techniques  are 
the  subspace  methods.  In  this  thesis,  a  comparative  study  is  done  between  bio¬ 
logical  methods  and  subspace  methods.  Further,  an  attempt  has  been  made  to 
incorporate  the  notion  of  head-related  transfer  function  in  the  modeling  of  sub¬ 
space  methods.  The  implementation  of  a  biological  localization  algorithm  on  a 
DSP  board  is  also  presented. 
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Chapter  1 


Introduction 


The  sound  localization  problem  is  to  estimate  the  direction  of  sound  sources  using 
measurements  of  the  signals  received  by  an  array  of  microphones.  Sound  localiza¬ 
tion  can  be  useful  in  many  applications  such  as  robotic  hearing,  human-machine 
interface,  electronic  surveillance  and  military  applications. 

1.1  Sound  localization  in  Biology 

Above  applications  apart,  sound  localization  is  an  important  part  of  our  lives.  For 
many  species  such  as  barn  owl,  it  is  a  matter  of  survival.  The  natural  capabilities 
of  human  and  animals  to  localize  sound  has  intrigued  researchers  for  many  years. 
Numerous  studies  have  attempted  to  determine  the  processes  and  mechanisms  used 
by  humans  or  animals  to  achieve  spatial  hearing. 

One  of  the  first  steps  in  understanding  nature’s  way  of  solving  this  problem  is 
to  understand  how  information  is  processed  in  the  ear.  A  number  of  models  for 
the  ear  have  been  suggested  by  the  researchers  [2,  3,  4],  These  studies  suggest 
that  the  cochlea  effectively  extracts  the  spectral  information  from  the  sound  wave 
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impinging  on  the  ear  drums  and  converts  it  into  the  electrical  signals.  The  cochlear 
output  is  in  the  form  of  electrical  signals  at  different  neuron  points  along  the  basilar 
nremebrane  of  cochlea.  The  electrical  signals  then  travel  up  to  the  brain  for  further 
processing. 

Many  researchers  have  come  up  with  different  models  of  processing  of  electri¬ 
cal  signals  in  the  brain  for  sound  localization  to  support  the  experimental  data 
from  various  neurophysiological  studies.  All  these  different  models  agree  on  the 
fundamental  view  that  the  direction  of  the  sound  is  determined  by  two  important 
binaural  cues  -  the  interaural  time  difference  and  the  interaural  level  difference. 
These  binaural  cues  arise  from  the  differences  in  the  sound  waveforms  entering  the 
two  ears.  The  interaural  time  difference  is  the  temporal  difference  in  the  wave¬ 
forms  due  to  the  delay  in  reaching  the  ear  farther  away  from  the  sound  source.  The 
interaural  level  difference  is  the  difference  in  the  intensity  of  the  sound  reaching 
the  two  ears.  In  general,  the  ear  which  is  farther  away  from  the  source  will  receive 
a  fainter  sound  than  the  ear  which  is  relatively  closer  to  the  source  due  to  the 
attenuation  effect  of  the  head  and  surroundings.  The  phenomena  of  time  delay 
and  the  intensity  difference  can  be  integrated  into  the  notion  of  interaural  transfer 
function  which  represents  the  transfer  function  between  the  two  ears. 

It  is  generally  accepted  that  cross- correlation  based  computational  models 
for  binaural  processing  provide  excellent  qualitative  and  quantitative  accounts 
of  experimental  studies.  These  models  can  be  broadly  classified  into  two  kinds, 
namely,  the  temporal-correlation  models  and  the  spatial-correlation  models.  In 
the  temporal-correlation  models  [11,  7],  the  cochlear  outputs  from  the  two  ears  are 
cross-correlated  at  various  time  delays.  In  the  implementation  of  such  a  model,  the 
cochlear  outputs  are  passed  through  delay  lines.  The  cochlear  outputs  from  one 
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ear  are  continuously  compared  with  the  delayed  cochlear  outputs  of  the  other  ear. 
In  the  spatial-correlation  models  [1],  the  instantaneous  cochlear  outputs  obtained 
from  one  ear  are  compared  with  the  shifted  image  of  the  instantaneous  outputs 
obtained  from  the  other  ear.  Thus,  the  spatial  correlation  models  eliminate  the 
need  of  the  delay  line  required  to  save  the  past  cochlear  outputs. 

The  output  patterns  obtained  from  the  cross-correlation  operations  reflect  the 
binaural  information  which  can  be  refined  further  and  interpreted  to  determine 
the  direction  of  the  source. 


1.2  Sound  localization:  a  signal  processing  view 

A  different  community  of  researchers  from  the  classical  signal  processing  area  has 
also  been  involved  in  solving  the  localization  problem  from  a  different  perspective. 
In  the  signal  processing  community,  the  more  commonly  used  term  for  this  problem 
is  direction- of- arrival  (DO A)  estimation.  Earlier  work  in  the  field  of  DOA  esima- 
tion  has  focused  on  narrow-band  signals.  It  was  shown  that  under  the  narrowband 
assumptions,  the  DOA  problem  is  equivalent  to  a  spectral  analysis  problem.  Thus, 
the  classical  Fourier-based  methods  like  periodograms  can  be  used  to  solve  it  un¬ 
der  some  conditions  [15].  In  1979,  Schmidt  [5]  proposed  a  new  algorithm,  MUSIC 
(Multiple  Signal  Classification),  which  introduced  a  new  paradigm  for  solving  the 
problem.  Roy  and  Kailath  [6]  showed  that  the  computations  and  the  memory  re¬ 
quired  by  MUSIC  can  be  reduced  significantly  by  requiring  that  the  sensors  occur 
in  matched  pairs.  The  algorithm  is  known  as  ESPRIT  (Estimation  of  Signal  Pa¬ 
rameters  via  Rotational  Invariance  Techniques).  The  common  element  in  MUSIC 
and  ESPRIT  is  the  concept  of  signal  subspace  which  exploits  the  underlying  data 
structure  in  the  data  model  for  binaural  processing. 
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The  subspace  algorithms  also  assume  that  the  signals  are  narrow-band  and 
therefore  cannot  be  applied  directly  to  the  sound  localization  problem  due  to  the 
wide-band  nature  of  the  sound.  The  problem  in  wideband  direction-of-arrival 
(DOA)  estimation  arises  from  the  fact  that  the  signal  subspace  is  different  for  dif¬ 
ferent  frequencies.  Many  researchers  have  approached  this  problem  by  resolving 
the  sensors  outputs  into  discrete  narrowband  frequency  bins  and  then  indepen¬ 
dently  applying  one  of  the  narrowband  subspace  techniques  to  each  frequency  bin. 
The  estimates  obtained  from  processing  of  each  of  the  frequency  bins  are  then 
averaged  in  some  sense  to  obtain  the  final  estimate  of  the  DOAs.  A  brief  survey 
of  the  efforts  made  in  wideband  DOA  problem  is  given  below: 

In  [16],  a  global  search  similar  to  that  of  spectral-MUSIC  [8]  is  performed  on 
the  individual  bins  to  estimate  the  null  spectra  for  the  narrowband  components. 
The  null  spectral  plots  for  each  frequency  bin  are  then  arithmetically  or  geomet¬ 
rically  averaged  and  the  directions  of  arrival  of  the  sources  are  determined  from 
the  peaks  in  the  pseudo-spectrum  plot.  Su  and  Morf  [17]  employed  a  different  ap¬ 
proach  in  which  the  sensor  output  is  modeled  as  multidimensional  AR  or  ARMA 
process,  i.e,  having  rational  spectrum.  They  generalize  the  notions  of  signal  sub¬ 
space  and  array  manifolds  to  rational  vector  space  and  develop  rational  signal 
subspace  theory  based  on  these  concepts.  The  theory  is  applied  to  derive  the  unit 
circle  eigendecomposition  rational  subspace  (UCERSS)  algorithm  for  source  loca¬ 
tion.  In  UCERSS,  the  frequency  domain  representation  of  wideband  signals  is  not 
explicitly  used  in  the  sense  that  the  sensor  outputs  are  not  narrowband  filtered 
to  estimate  correlation  matrices  for  each  frequency  bins.  Rather,  the  correlation 
matrix  is  first  estimated  and  then  transformed  into  the  frequency  domain  using 
one  of  the  multidimensional  rational  spectrum  modeling  schemes.  The  narrowband 
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signal  subspace  processing  is  then  applied  to  discrete  points  on  unit  circle  in  the 
frequency  domain.  The  individually  obtained  estimates  are  then  combined  in  a 
similar  fashion  as  [16]. 

Su  and  Morf  [18]  proposed  another  solution  based  on  the  rational  signal  sub¬ 
space  model  known  as  modal  decomposition  signal  subspace  (MDSS)  algorithm. 
It  uses  the  fact  that  the  output  of  the  array  at  the  system  poles  is  characterized 
by  the  emitters  sharing  that  pole.  The  column  space  of  the  residue  matrices  at  the 
system  poles  spans  the  signal  subspace  corresponding  to  the  emitters  sharing  that 
pole.  By  decomposing  the  emitter  signals  in  this  manner,  more  sources  can  be  re¬ 
solved  than  the  number  of  sensors  in  the  array.  The  number  of  sources  that  can  be 
resolved  at  a  pole  is  limited  by  the  number  of  sensors.  Otterston  and  Kailath  [19] 
applied  the  ideas  in  modal  decomposition  signal  subspace  algorithm  to  ESPRIT 
that  retained  the  basic  advantages  of  ESPRIT  as  compared  to  MUSIC,  namely  the 
reduced  number  of  computations  and  that  the  knowledge  of  array  characteristics 
is  not  required. 

An  alternative  representation  of  wideband  signals  was  proposed  in  [20]  based 
on  a  low-rank  charecterization  of  the  signal  in  a  higher  dimensional  space  but  it 
requires  large  number  of  computations. 

In  1983,  Wang  and  Kaveh  [21]  demonstrated  that  it  is  possible  to  have  a  low- 
rank  model  of  the  system.  They  proved  that  there  exist  linear  transformations  that 
map  the  estimated  subspace  for  one  frequency  to  a  focussing  frequency.  The  linear 
transformations  are  known  as  focusing  matrices.  The  sensor  outputs  are  resolved 
in  narrow  frequency  bands  and  their  subspace  estimate  is  mapped  to  a  single 
focusing  frequency  by  muliplying  them  by  corresponding  focusing  matrices  giving 
a  low-rank  model  of  covariance  matrix.  A  narrowband  DOA  estimation  scheme 
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can  then  be  applied  to  this  covariance  matrix.  This  technique  is  called  coherent 
signal  subspace  processing.  The  computation  of  linear  transformations  require 
premilinary  knowledge  of  angles  which  can  be  obtained  using  low  resolution  (and 
hence  computationally  inexpensive)  methods  such  as  periodogram  or  conventional 
beamformer.  In  [22] ,  it  is  shown  that  unitary  focusing  matrices  result  in  improved 
performance.  [22]  and  [23]  describe  methods  to  compute  unitary  focusing  matrices. 

Doron  et  al.  [24]  discovered  a  separable  representation  of  the  array  manifold 
such  that  array  characteristics  (such  as  array  geometry)  and  the  frequency  of  the 
source  signals  can  be  separated  from  the  angles-of-arrival.  This  made  it  possible  to 
find  transformations  that  do  not  require  premilinary  estimates  of  the  angles.  This 
method  is  termed  as  Array  Manifold  Interpolation  (AMI).  The  separable  represen¬ 
tation  in  AMI  is  obtained  by  using  infinite  series  expansion  of  plane  waves  in  polar 
coordinates.  The  finite  series  approximation,  in  general,  requires  a  large  number 
of  sensors.  For  the  special  case  of  a  uniform  circular  array,  termed  the  Circular 
Manifold  Interpolation  (CMI),  the  AMI  method  can  be  implemented  efficiently 
using  the  FFT  algorithm. 

One  of  the  primary  objectives  of  this  thesis  is  to  compare  the  biological  models 
and  subspace  models.  Among  the  biological  algorithms,  we  considered  two  algo¬ 
rithms  covering  both  the  temporal-correlation  and  the  spatial-correlation  based 
techniques.  Among  the  subspace  algorithms,  we  have  considered  both  the  MUSIC 
and  ESPRIT-based  methods.  The  finer  details  of  the  computations  involved  in  the 
subspace  models,  however,  differ  from  the  models  described  above.  An  attempt 
has  been  made  to  incorporate  the  concept  of  interaural  transfer  function  which  is 
integral  to  the  biological  models. 

The  thesis  is  organized  as  follows.  In  Chapter  2,  the  hardware  and  software 
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setup  for  the  implementation  of  sound  localization  has  been  described.  Chapter 
3  deals  with  the  biological  models  in  sound  localization  describing  the  fundamen¬ 
tal  concepts  of  head-related  transfer  function  and  the  interaural  transfer  function. 
Lyon’s  cochlear  model  and  the  function  of  different  blocks  in  the  model  are  dis¬ 
cussed.  The  output  of  the  cochlear  model  is  applied  to  two  localization  systems, 
one  based  on  the  temporal  correlation  methods  and  the  other  based  on  the  spatial 
methods  (stereausis) .  The  various  computations  involved  and  the  implementation 
on  DSP  hardware  are  described  in  detail.  Chapter  4  focuses  on  the  subspace  meth¬ 
ods  for  sound  localization.  The  data  model  for  subspace  algorithms  is  developed 
that  describes  the  relationship  between  the  output  of  the  sensors  and  the  signals 
emitted  by  the  sources  and  their  dependence  on  various  parameters  such  as  the 
response  characterstics  of  the  HRTF  and  sensors,  and  the  location  of  the  sources 
with  respect  to  the  sensor  array.  The  MUSIC  and  ESPRIT  algorithms  are  derived 
and  methods  to  estimate  the  direction  of  sound  are  developed.  Finally,  in  Chap¬ 
ter  5,  the  results  and  the  performance  of  all  the  four  methods  are  presented  and 
discussed. 
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Chapter  2 


Implementation  Setup 


This  chapter  describes  the  set  up  for  the  real-time  implementation  of  sound  local¬ 
ization  alogrithms. 

2.1  Hardware  Components 

Figure  2.1  shows  the  major  components  in  the  physical  set  up  of  our  system.  The 
microphones  mounted  on  the  dummy  head  of  a  robot  collect  the  sound  signals. 
These  signals  are  sent  to  the  PC  on  which  the  Coreco  board  is  mounted  using 
a  wireless  LAN  setup  in  the  laboratory.  The  server  program  running  on  the  PC 
receives  the  audio  signals  and  passes  them  onto  the  Coreco  Python/C67  board  for 
processing  and  computation  of  the  direction  of  the  sound  source.  We  will  now 
describe  each  of  the  components  in  greater  detail. 

2.1.1  The  Robot 

The  robot  (Figure  2.2)  used  in  the  project  is  a  Super  Scout  II,  manufacured  by 
Nomadic  Technologies.  It  has  an  onboard  computer  powered  by  Pentium  233 


Audio 

input 


Audio  input 


Angles 

(training  phase) 


Estimated 

direction 


Figure  2.1:  Dataflow  between  major  hardware  components. 


Figure  2.2:  Super  Scout  II  Robot 

MHz  processor  which  runs  RedHat  Linux.  A  dummy  head  made  of  Styrofoam 
was  mounted  on  top  of  the  robot.  There  were  two  microphones  placed  in  the 
head  at  approximately  the  same  locations  as  the  human  ears.  The  microphones 
were  connected  to  a  sound  card  in  the  computer  system  through  analog  amplifiers 
and  filters.  The  filters  were  used  to  band-limit  the  input  signals  to  18.5  kHz. 
This  simulated  the  hearing  range  of  the  humans.  Secondly,  these  filters  acted  as 
anti-aliasing  filters  for  discrete-time  sampling  by  the  sound  card.  The  sound  card 
digitized  the  audio  signals  at  the  rate  of  40  kHz.  The  discrete-time  samples  of  the 
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audio  signals  received  from  the  two  microphones  were  multiplexed  together  in  one 
stream.  The  stream  was  then  sent  to  the  PC  using  a  TCP/IP  connection  over  a 
wireless  LAN. 

The  reasons  for  using  a  robotic  system  for  acquiring  the  sound  data  are  two¬ 
folds: 

1.  The  robot  provided  a  mobile  platform  which  was  required  for  the  online 
training  of  the  sound  localization  system  as  explained  later  in  Chapter  3. 

2.  Such  a  system  can  be  used  for  further  research  in  problems  like  human- 
machine  interface,  obstacle  avoidance  and  so  on. 

2.1.2  The  Coreco  Board 

The  Coreco  Python/C67  board  was  the  core  component  of  the  whole  system  on 
which  the  sound  localization  algorithm  was  implemented.  It  is  a  multi-DSP  board 
based  on  Texas  Instruments’  TMS320C6701  DSP  chips.  The  configuration  of  our 
board  consisted  of  four  DSP  chips  connected  via  dedicated  communication  link 
with  a  peak  bandwidth  of  400  MB/s.  The  board  provides  up  to  6400  MIPS  of  pro¬ 
cessing  power  making  it  suitable  for  intense  number-crunching  required  by  signal 
processing  algorithms.  The  system  is  designed  for  multiprocessing  applications.  In 
our  implementation  of  sound  localization  system,  we  used  all  the  four  DSPs. 

2.1.3  The  Host  Computer 

A  Pentium  PC  running  Windows  NT  was  used  as  host  to  the  Coreco  board.  It  was 
connected  to  the  Coreco  board  using  the  PCI  Bus.  The  host  computer  not  only 
provided  an  interface  to  the  Coreco  board  but  also  acted  as  a  communication  link 
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between  the  robot  and  the  Coreco  board.  The  Code  Composer  Studio  provided 
the  platform  for  the  software  development  and  debugging  environment. 


2.2  Software  Components 

Figure  2.3  shows  the  key  software  components  that  were  developed  for  the  project 
and  the  flow  of  information  between  them.  The  NETSRV  program1  running  on 
NT  machine  is  the  central  link  for  exchange  of  information  between  the  user,  the 
robot  and  the  Coreco  board. 

The  console  window  provided  by  NETSRV  is  used  to  interact  with  the  user  for 
loading  the  sound  localization  programs  onto  the  DSP  chip,  uploading  of  certain 
parameters  required  by  the  algorithms  and  setting  up  the  socket  connections  be¬ 
tween  the  data  acquisition  program  and  the  control  program  running  on  the  robot 
computer. 

The  data  acquisition  program  on  the  robot  digitizes  the  audio  signals  received 
from  the  microphones,  opens  a  TCP/IP  socket  and  waits  for  the  connection  to  be 
set  up.  Once  the  connection  is  completed  by  the  NETSRV  program,  it  continuously 
sends  out  the  audio  data  in  blocks  of  size  1024  samples. 

The  robot  control  program  is  used  only  in  the  learning  phase.  It  controls  the 
movement  of  the  robot  and  is  discussed  in  greater  detail  in  the  next  chapter. 

The  communication  between  the  Coreco  board  and  the  NETSRV  program  was 
realized  using  the  application  programmer’s  interface  (API)  software  modules  pro¬ 
vided  with  the  Coreco  system. 

1The  NETSRV  program  was  written  by  Cliff  Knoll  of  Neural  Systems  Lab 
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2.2.1  C60  Host  API 


The  Host  API  follows  a  shared  memory  model  whereas  the  DSP’s  memory  is  vis¬ 
ible  from  the  host  application.  Most  of  the  operations  are  carried  out  by  directly 
mapping  the  DSP’s  memory  onto  the  host.  These  APIs  offer  a  basic  set  of  func¬ 
tionalities  for  communication  between  the  C6701  and  the  host  program  NETSRV. 
More  complex  functionalities  were  built  using  the  simple  functions. 

2.2.2  C60  Native  API 

C6701  offers  low  level  APIs  that  are  called  C60  Native  API.  These  APIs  are  used 
for  message  passing  between  the  host  and  the  DSP,  memory  allocation,  interrupts, 
timer,  buffers  and  direct  memory  access  (DMA)  management.  These  APIs  have 
been  used  extensively  by  the  NETSRV  program  as  well  as  the  sound  localization 
program. 

The  APIs  pass  on  the  input  audio  data  to  the  sound  localization  firmware 
running  on  TI  TMS320C6701  DSP  which  computes  the  estimated  direction  of 
the  sound  source  and  relays  it  back  to  the  PC.  The  implementation  of  the  sound 
localization  firmware  is  described  in  Chapter  3. 
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Figure  2.3:  A  schematic  diagram  of  the  interaction  among  the  software  blocks. 
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Chapter  3 


Biological  Methods  for  Sound 
Localization 


The  algorithms  presented  in  this  chapter  were  inspired  by  how  humans  and  animals 
localize  sounds.  The  algorithms  use  the  cochlear  model  to  separate  the  spectral 
information  in  the  sound  wave. 


3.1  The  Cochlear  Model 

The  cochlear  model  is  an  attempt  to  model  the  mammalian  cochlea  based  on  neuro¬ 
physiological  studies.  This  model  describes  the  propagation  of  sound  in  the  inner 
ear  and  the  conversion  of  the  acoustical  energy  into  neural  representations.  Sound 
that  enters  the  outer  and  the  middle  ear  is  passed  through  the  oval  window  into  the 
cochlea.  Once  in  the  cochlear  duct,  the  pressure  wave  propagates  down  the  basilar 
membrane.  The  stiffness  of  the  basilar  membrane  varies  smoothly  over  its  length. 
Thus  a  point  in  the  basilar  membrane  is  most  resonant  to  a  particular  frequency  in 
the  pressure  wave.  The  vibrations  at  different  points  in  the  membrane  are  sensed 
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by  the  hair  cells  which  convert  the  mechanical  signals  into  electrical  signals.  These 
electrical  signals  are  then  communicated  to  higher  levels  in  the  brain. 

Since  each  point  in  the  basilar  membrane  responds  best  to  one  frequency,  it 
effectively  decomposes  the  acoustical  energy  into  different  frequency  bands.  The 
cochlea  near  its  base  (where  the  sound  enters)  is  most  sensitive  to  high  frequency 
sounds  and  as  the  wave  travels  down  the  cochlea,  it  becomes  more  sensitive  to 
lower  frequencies. 

This  frequency  dependent  response  of  cochlea  can  be  best  modeled  as  contin¬ 
uous  differential  equations.  However  for  implementation  purpose,  it  is  normally 
modeled  in  discrete  sections  as  a  bank  of  bandpass  filters,  called  cochlear  filters. 
These  filters  separate  the  input  to  the  ear  in  different  frequency  bands  or  channels. 
The  output  of  each  cochlear  filter  is  passed  through  non-linear  structures  such  as 
half-wave  rectifier  (HWR)  and  automatic  gain  controller  (AGC)  to  simulate  the 
response  of  actual  human  cochlea.  Figure  3.1  shows  the  schematic  diagram  of  the 
cochlear  model.  The  output  of  the  cochlear  model  is  a  set  of  N  signals,  where  N 
is  the  number  of  cochlear  filters. 

3.1.1  The  Cochlea  Filter  Bank 

The  cochlear  filters  can  be  emulated  by  the  gammatone  filters.  In  our  experiments, 
we  used  a  bank  of  N  =  129  cochlear  filters  with  characteristic  frequencies  spanning 
the  whole  audio  spectrum.  The  frequency  response  of  some  of  the  cochlear  filters 
are  shown  in  the  Figure  3.2.  As  we  see  from  the  figure,  the  filters  lying  in  the  same 
neighborhood  have  large  overlap  which  introduces  correlation  across  the  frequency 
channels.  This  correlation  is  exploited  by  the  spatial-correlation-based  stereausis 
algorithm  for  binaural  processing. 


15 


Audio 

Input 


Middle-Ear 

Filter 


_ 1 

i 

_ \ 

i 

f _ 

Filter 

HWR 

AGC 

Figure  3.1:  Lyon’s  cochlear  model. 
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3.1.2  Automatic  Gain  Controller 


The  cochlear  filters  are  followed  by  simple  half-wave  rectifiers.  The  output  of  a  half¬ 
wave  rectifier  models  the  non-linearity  of  the  hair  cells,  providing  a  non-negative 
output  representing  neural  responses. 

Automatic  gain  controllers  are  used  to  capture  other  non-linearities  of  the  ear 
such  as  saturation  and  masking.  A  four-stage  automatic  gain  controller  (AGC)  was 
used.  The  signals  of  each  channel  coming  out  of  the  HWR  stages,  pass  through 
these  four  AGC  stages.  The  gain  of  each  stage  depends  on  a  time  constant.  The 
different  time  constants  simulate  the  different  adaptive  times  of  our  auditory  sys¬ 
tem;  the  first  AGC  stage  has  the  biggest  time  constant  so  that  it  reacts  to  the  input 
signal  more  slowly,  while  the  following  stages  have  decreasing  time  constants.  The 
AGC  stages  of  each  channel  are  coupled  to  the  corresponding  AGC  stages  of  the 
adjacent  channels.  Thus  a  channel  can  affect  the  output  of  all  the  channels  in 
the  filter  bank  although  the  effect  will  decay  exponentially  with  distance.  Such  a 
coupling,  in  effect,  produces  masking  effects  in  the  cochlear  output.  The  outputs 
of  the  last  stage  approximately  represent  the  neural  firing  rates  produced  by  the 
transformation  of  various  parts  of  the  cochlea  due  to  the  sound  pressure  waves 
entering  the  inner  ear. 

Figure  3.3  shows  the  implementation  of  the  AGC  [25].  The  objective  of  AGC 
is  to  attenuate  the  input  signal  so  that  on  average  it  remains  below  a  target  value. 
The  loop  filter  is  a  simple  low  pass  filter  with  a  feedback  gain  of  (1  —  e)/3.  The 
time  constant  is  related  to  the  parameter  e  by  the  following  equation 

time  constant  =  1  —  exp{— Fs/e] 
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where  Fs  is  the  sampling  frequency.  A  longer  time  constant  means  that  the 
response  of  AGC  to  the  input  is  slower. 

The  states  of  the  two  adjacent  cochlear  frequency  channels  are  combined  with 
the  current  channel  and  averaged.  The  target  parameter  is  used  to  scale  the  input 
to  the  loop  filter.  In  long  run,  as  shown  below,  the  state  will  track  the  value  of  the 
output  of  the  AGC  divided  by  the  value  of  target. 

Assuming  the  state  values  of  the  adjacent  and  the  current  channels  are  equal, 
the  state  equation  can  be  written  as 

state(n)  =  +  3  •  state(n  —  1) — — —  (3.1) 

In  long  term,  for  constant  value  of  y,  the  state(n)  is  given  by 


lim  state  — *  — — — 
n^°°  target 


(3,2) 


The  output  of  AGC,  in  long  term,  is  then  given  by 

target  x 

y  7 - T7 -  (3-3) 

target  +  x 

The  values  of  time  constant  and  target  parameters  used  in  the  implementation 


were: 


AGC  stage 

Time  constant 

target 

First  AGC 

640  ms 

0.0032 

Second  AGC 

160  ms 

0.0016 

Third  AGC 

40  ms 

0.0008 

Fourth  AGC 

10  ms 

0.0004 

Table  3.1:  Paramaters  of  Automatic  Gain  Controller 


19 


Input  (x) 


state  from  left 
channel 


state  from  right 
channel 


(1-e)/3 


e/target 


(y) 


Figure  3.3:  A  schematic  diagram  of  the  automatic  gain  controller  (adapted  from 
Slaney  [25]). 
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3.2  Interaural  Transfer  Function 


Before  we  define  the  interaural  transfer  function,  we  will  describe  the  important 
concepts  that  characterize  the  interaural  transfer  function.  They  are  as  follows. 

3.2.1  Binaural  Cues 

The  differences  in  the  sound  waves  impinging  on  the  two  ears  are  known  as  binaural 
cues.  Such  differences  are  essential  to  locate  sound  sources  in  space.  Interaural 
time  difference  (ITD)  and  the  interaural  level  difference  (ILD)  are  recognized  as 
the  two  most  important  binaural  cues  for  localization. 

Interaural  Time  Difference 

The  interaural  time  difference  is  the  time  delay  between  the  signals  reaching  the 
two  ears  that  arises  because  the  separation  of  the  ears  introduces  a  path  length. 
The  time  delay  depends  on  the  separation  distance  between  the  ears,  the  angle  of 
arrival  of  the  sound  wave,  and  its  speed  of  propagation.  It  is  generally  difficult  to 
measure  the  time  delay,  per  se.  So,  usually,  the  phase  difference  in  the  two  signals 
is  used  as  a  measure  of  ITD.  For  this  reason,  ITD  is  also  know  as  interaural  phase 
difference. 

Interaural  Level  Difference 

The  interaural  level  difference  is  the  difference  in  the  intensity  of  the  signals  reach¬ 
ing  the  ears.  Sound  waves  that  come  from  different  directions  in  the  space  are 
diffracted  and  scattered  by  the  head,  shoulders,  torso,  etc.  This  causes  differences 
in  the  wave  appearing  on  the  ear  drums  and  is  the  basis  of  ILD. 
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It  may  seem  that  the  ITD  information  should  be  sufficient  for  sound  localization 
but  that  is  not  so.  ILD  complements  the  ITD  information  in  many  situations.  One 
of  them  is  in  the  case  of  higher  frequencies  (>  1500  Hz)  when  the  phase  information 
becomes  ambiguous.  This  can  be  explained  using  the  Nyquist  sampling  theorem 
and  its  equivalence  in  the  spatial  sampling  case.  Indeed,  at  a  particular  instant 
of  time,  the  sound  waveform  is  sampled  by  microphones  at  two  points  in  space. 
The  distance  between  the  microphones  is  analogous  to  a  time-sampling  period. 
If  this  distance  is  greater  than  half  the  wavelength  of  the  signal,  then  the  phase 
information  cannot  be  determined  with  certainity. 

The  ILD  information  may  also  be  useful  in  solving  the  cone  of  confusion  prob¬ 
lem.  The  cone  of  confusion  is  the  set  of  all  directions  for  which  the  time  delay  is 
same.  If  the  ILD  information  is  different  for  each  of  the  directions  on  a  cone,  it 
can  be  used  to  discriminate  and  locate  the  direction  of  the  source. 

3.2.2  Head-Related  Transfer  Function 

The  transformation  of  the  sound  wave  from  the  source  to  the  ear  is  normally 
described  by  a  transfer  function  called  the  head- related  transfer  function  (HRTF). 
The  HRTF  is  a  function  of  the  frequency  of  the  signal  and  the  location  of  the 
source  with  respect  to  the  head.  The  location  of  the  source  can  be  specified  by  its 
range,  azimuth  angle  and  the  elevation  angle.  In  this  thesis,  we  shall  be  concerned 
with  the  estimation  of  azimuth  angle  only.  The  elevation  angle  will  be  asssumed  to 
be  zero  at  all  times.  However,  the  extension  to  2-D  case  of  azimuth  and  elevation 
estimation  is  straightforward  in  most  cases.  Further,  the  source  will  be  assumed 
to  be  in  the  far  held;  thus  the  dependence  of  HRTF  on  the  range  will  be  ignored. 
To  this  end,  consider  a  sound  source  located  at  azimuth  angle  9  with  respect  to 
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the  head.  Let  S(u)  be  the  Fourier  transform  of  the  source  signal,  Hx(uj,9)  and 
Hy(uj,9)  be  the  HRTF’s  of  left-ear  and  right-ear  respectively,  then  the  Fourier 
transform  of  the  signals  received  at  the  two  ears  can  be  given  by 


X(u,d)  =  Hx(u,9)S(u) 
Y(u,9)  =  Hy(uj,  9)S(u>) 


(3.4) 

(3.5) 


Next,  we  define, 


F(u,9)  = 


Hy(u,9) 


(3.6) 


Hx(u,0) 

F(oj ,  9)  is  known  as  the  interaural  transfer  function  (ITF).  The  interaural  trans¬ 
fer  function  captures  the  important  binaural  cues.  The  interaural  time  difference 
is  captured  in  the  phase  information  of  the  ITF.  More  specifically,  the  derivative  of 
arg(F(u;,  9))  with  respect  to  u  gives  the  ITD.  Note  that  introduction  of  frequency- 
dependent  HRTF  results  in  dependence  of  the  interaural  time  (phase)  difference 
on  frequency.  On  the  other  hand,  the  magnitude  of  F(u>,  9)  provides  a  measure  of 
frequency-dependent  ILD. 

The  ITF  can  be  estimated  by  taking  the  ratio  of  Fourier  transforms  of  the 
signals  received  at  the  left-ear  and  the  right-ear. 

Y(u,9) 


F(u,9)  = 


(3.7) 


X(u,9) 

It  is  important  to  note  that  F(u,  9)  is  independent  of  the  source  spectrum  and 
thus  can  be  used  to  find  the  location  of  any  wideband  source.  This  observation 
can  be  utilized  for  finding  a  simple  way  of  solving  the  sound  localization  problem 
using  a  priori  information.  Suppose  the  actual  interaural  transfer  function  of  the 
head,  F(u>,  9),  is  known  a  priori.  This  information  may  be  obtained  from  a  training 
process.  Later,  in  order  to  estimate  the  direction  of  an  unknown  source  signal,  one 
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can  estimate  the  interaural  transfer  function  of  the  head  from  the  received  signals 
using  (3.7)  and  compare  it  with  the  known  F(u,9).  The  value  of  6  for  which  the 
estimated  interaural  transfer  function  is  ‘closest’  to  the  actual  interaural  transfer 
function  gives  the  direction  of  the  source. 


3.3  The  Localization  System 

The  method  of  sound  localization  described  in  this  section  was  proposed  by  Lim 
and  Duda  [7].  Figure  3.4  shows  the  schematic  diagram  of  the  localization  system. 
The  input  source  signal  received  at  the  ears  are  processed  through  a  cochlear  model. 
The  output  of  the  cochlear  model  is  used  to  obtain  the  binaural  cues,  namely  the 
ITD  and  the  ILD. 

A  common  way  of  calculating  the  ITD  cue  is  to  first  crosscorrelate  the  cochlear 
outputs  of  corresponding  left-ear  and  right-ear  channels  for  different  time-lags  and 
finding  the  time-lag  for  maximum  crosscorrelation.  A  single  interaural  time  dif¬ 
ference  is  arrived  at  by  averaging  the  ITD  values  obtained  for  each  channel.  In 
mathematical  notation,  if  ITDi  represents  the  ITD  for  the  frequency  channel  i. 
then 


ITDi  =  arg  max  V  x^y^k  -  l) 

Imaxilmax] 

(3.8) 

ITD  =  V  ITD, 

N  V 

(3.9) 

where,  Xi(k),  Ui(k)  G  7 Z  are,  respectively,  discrete-time  left-ear  and  righ-ear 
cochlear  outputs  for  channel  i.  The  notation  l  is  the  time-lag  for  crosscorrelation, 
and  lmax  is  the  maximum  time-lag  possible  between  the  signals  received  at  the 
two  ears.  Maximum  time-lag  lmax  =  (A /c)Fs  depends  on  the  distance  between 
the  ears  A,  the  sampling  frequency  Fs,  and  the  propagation  speed  of  the  sound  c. 
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Figure  3.4:  The  Localization  System  (taken  from  Lim  and  Duda  [7]). 
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The  Jeffress  network  [14]  shown  in  Figure  3.5  provides  an  efficient  way  to  compute 
temporal  correlations  at  different  time-lags. 

To  compute  the  ILD,  the  AGC  is  disabled  in  order  to  preserve  information 
regarding  the  amplitude  level  differences  in  the  signals.  The  ILD  spectrum  is 
obtained  by  calculating  the  logarithm  of  the  ratio  of  the  signal  energies  for  the 
corresponding  channels  in  the  left-ear  and  the  right-ear.  The  signal  energy  in  each 
channel  is  estimated  by  computing  the  zero-lag  autocorrelation  of  the  channel 
output. 

ILD i  =  10(logw(Y,xi(k)xi(k))  -  log10(^2  yi(k)yi(k)))  (3.10) 

k  k 

The  vector  [. ITD ,  I  LD\.  •  •  • ,  I LDN]  contains  information  regarding  the  inter- 
aural  transfer  function  and  will  be  known  as  ITF  vector. 

The  ITF  vector  is  an  approximation  to  the  interaural  transfer  function  F(uj.  0) 
as  described  in  the  previous  section.  Firstly,  it  assumes  that  the  phase  in  F(u,6) 
does  not  depend  on  the  frequency,  and  uses  a  single  value  of  ITD  to  represent 
the  phase  information.  Secondly,  the  magnitude  of  F(u,  6)  is  not  computed  us¬ 
ing  Discrete-Fourier  Transforms  (DFTs)  but  from  the  frequency  channels  in  the 
cochlear  model. 

3.3.1  Learning  of  Interaural  Transfer  Function 

The  training  of  the  system  to  learn  the  interaural  transfer  function  is  the  first 
step  towards  estimating  the  direction  of  an  unknown  sound  source.  It  requires  a 
controlled  environment  to  reduce  the  errors  due  to  random  noise.  A  white  noise 
sound  source  is  used  for  the  training  purpose.  In  order  to  compute  the  ITF  vector 
for  angle  9,  the  source  is  placed  at  that  angle.  The  measurements  obtained  from 
the  microphones  are  used  in  equations  (3.8),  (3.9)  and  (3.10)  to  compute  the 
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Figure  3.5:  The  Jeffress  network. 
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ITF  vector  at  angle  9.  Since  the  vector  incorporates  the  information  regarding 
all  the  frequency  channels,  it  is  denoted  by  F(9),  with  slight  abuse  of  notation, 
to  represent  the  interaural  transfer  function  of  the  system  at  angle  0.  The  set 
{F(9),9  G  0},  represents  the  learning  or  the  training  set  of  interaural  transfer 
function,  where  0  denote  the  set  of  all  angles  for  which  the  system  is  trained. 

3.3.2  Estimation  of  direction 

To  estimate1  the  direction,  9 ,  of  an  unknown  source,  the  waveforms  received  at 
the  two  ears  are  processed  through  the  cochlear  model  and  the  interaural  transfer 
function  represented  by  the  ITF  vector  F  is  estimated.  The  vector  F  may  be 
different  from  the  vectors  in  {F(9),9  G  0}  because  of  the  random  noise  and 
the  variation  of  the  location  of  the  source  from  the  angles  in  0.  The  maximum 
likelihood  (ML)  approach  is  followed  to  estimate  0.  Under  standard  assumptions 
of  independence,  additive  Gaussian  noise  and  arbitarily  large  training  set,  the  ML 
method  says  that  the  best  estimate  is  given  by  the  following  expression 

9  =  argmin(||F  —  F(0)||2),  0G0  (3-11) 

6 

The  authors  of  [7]  also  call  it  as  the  nearest-neighbor  estimator  as  it  involves 
finding  the  a  vector  in  {F(9),  9  G  0}  which  is  closest  to  F  in  the  sense  of  Euclidean 
distance. 

throughout  this  thesis,  the  notation^"  is  used  to  denote  the  corresponding  quantities  in  the 
estimation  phase  to  be  differentiated  from  the  theoretical  values  and  the  quantities  in  the  training 
phase. 
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3.4  Implementation  in  Real-Time 


The  algorithm  was  implemented  on  a  Coreco  board  in  C.  A  cochlear  model  with 
129  channels  was  used.  The  implementation  of  129  channels  required  a  large 
number  of  computations  and  memory.  So,  three  DSPs  were  used  with  43  channels 
implemented  on  each  of  them.  The  fourth  DSP  was  used  for  implementing  the  ML 
nearest-neighbor  estimator.  The  system  can  be  run  in  two  modes,  the  estimation 
mode  and  the  learning  mode. 

•  Estimation  Mode:  In  the  estimation  mode,  the  audio  signals  emmitted  by 
a  source  is  received  by  the  microphones  and  sent  to  the  Coreco  board  which 
computes  the  vector  F .  The  estimator  implemented  on  the  fourth  DSP  picks 
up  this  vector,  compares  it  with  the  training  data  and  gives  out  the  angle 
corresponding  to  the  closest  match  as  the  estimated  direction  of  the  source. 

•  Learning/training  mode:  The  user  can  switch  the  system  from  the  esti¬ 
mation  mode  to  learning  mode  from  the  robot  console.  The  whole  process 
of  learning  interaural  transfer  function  is  automated;  except  that  the  system 
assumes  that  a  broadband  sound  source  is  present  at  azimuth  angle  equal 
to  zero.  As  soon  as  the  user  specifies  switching  to  learning  mode,  the  old 
data  buffers  in  the  Coreco  system  are  flushed,  and  the  program  switches  to 
learning  mode.  On  the  robot  side,  the  robot  console  program  instructs  the 
robot  to  rotate  in  incremental  steps  (the  step  size  in  degrees  can  be  specified 
by  the  user).  The  white  noise  sound  data  emitted  from  a  speaker  is  recorded 
at  different  angles  and  is  sent  to  Coreco  board  for  further  processing.  At  the 
same  time,  the  robot  also  sends  out  the  angle  that  the  robot  is  making  with 
the  sound  source  (computed  using  the  step  size).  This  angle  is  needed  so  that 
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it  can  be  attached  as  a  tag  to  the  corresponding  ITF  vector  in  the  training 
set.  Proper  care  is  taken  to  keep  coordination  between  the  robot  and  the 
Coreco  board  during  the  transfer  of  the  sound  data  from  the  robot  and  the 
computation  of  ITF  information  at  the  Coreco  board  in  order  to  ensure  that 
the  sound  data  used  for  the  computation  matches  with  the  angle  associated 
with  it. 

The  Coreco  board  presented  severe  constraints  on  the  availability  of  memory; 
both  in  terms  of  the  program  size  as  well  as  the  data  size.  In  fact,  it  was  so  severe 
that  it  was  not  possible  to  hard  code  the  coefficients  of  the  cochlear  filter  in  the 
program.  This  made  the  program  so  large  that  it  would  not  fit  into  the  program 
memory.  To  get  around  this  problem,  ‘Read  hie’  utility  provided  by  NETSRV 
program  was  utilized.  A  MATLAB  hie  was  written  that  generated  three  binary 
hies  for  each  of  the  three  DSPs  on  which  the  cochlear  hlters  were  implemented. 
The  hies  consisted  of  a  header  followed  by  the  hlter  coefficients.  The  information  in 
the  header  was  used  to  dynamically  allocate  memory  where  all  the  hlter  coefficients 
were  stored.  Pointers  to  special  data  structures  were  utilized  to  retrieve  the  proper 
coefficients  of  a  cochlear  hlter. 

Due  to  the  high  order  of  the  cochlear  hlters,  circular  buffers  were  used.  In 
general,  implementation  of  digital  hlters  requires  shift  registers  to  realize  the  delay 
lines.  The  disadvantage  of  shift  registers  is  that  every  time  a  new  sample  comes 
in,  the  data  in  the  shift  register  needs  to  be  shifted  to  accomodate  the  new  sample. 
This  process  of  shifting  the  registers  reduces  the  efficiency.  In  circular  buffers,  on 
the  other  hand,  the  new  data  simply  overwrites  the  oldest  data.  The  TI  ’C6701 
DSP  processor  provides  hardware  support  for  the  circular  buffers.  To  utilize  this 
facility,  the  hlters  were  implemented  in  Assembly  language  which  resulted  in  faster 
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execution  of  the  filtering  operation. 


3.5  The  Stereausis  Algorithm 

The  stereausis  model  was  proposed  by  Shamma  et.  al  [1],  It  is  a  two-dimensional 
model  and  measures  the  binaural  cues  by  detecting  the  instantaneous  disparities 
in  the  cochlear  responses  along  the  basilar  membrane  in  the  two  ears. 

At  any  instant  of  time,  the  outputs  of  a  cochlea  can  be  viewed  as  a  snapshot 
of  a  traveling  wave  in  the  basilar  membrane.  The  stereausis  model  utilizes  the 
disparities  in  the  two  traveling  waves  of  the  ears  to  compute  the  binaural  cues. 
For  instance,  the  interaural  delay  in  the  sound  signals  received  by  the  two  ears  will 
produce  traveling  waves  such  that  one  is  phase-shifted  in  one  ear  relative  to  the 
other  (see  Figure  3.6(b)).  In  other  words,  the  snapshots  of  the  waves  traveling  along 
the  basilar  membrane  will  appear  shifted  in  space.  Similarity,  the  interaural  level 
difference  due  to  the  HRTFs  will  produce  relative  disparities  in  the  amplitudes 
of  the  traveling  waves.  Such  differences  are  known  as  spatial  disparities  in  the 
stereausis  model.  Figure  3.5  shows  the  stereausis  network  that  is  used  to  measure 
binaural  differences  from  the  spatial  disparities. 

The  channel  outputs  of  the  cochlear  model  is  fed  into  the  network  which  pro¬ 
duces  a  2-D  image  or  a  matrix.  Both  the  axes  of  the  image  represent  the  char- 
acterstic  frequencies  (CF)  of  the  channels.  The  elements  of  the  image,  cr].  are 
computed  by  cross-correlating  the  outputs  of  the  ith  frequency  channel  of  the  left 
ear  with  that  of  the  j th  frequency  channel  of  the  right  ear.  If  C(-,  •)  represents  the 
cross-correlation  function,  then 


Cij  =  C(xi,yj ) 


(3.12) 
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Figure  3.6:  Traveling  waves  in  the  basilar  membrane  of  the  cochlea  (adapted  from 
Shamma  et.  al  [1]). 
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Figure  3.7:  The  stereausis  representation  (adapted  from  Shamma  et.  al  [1]). 
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where  Xi  and  y:)  represents  the  instantaneous  outputs  of  zth  and  jth  channels 
of  the  left-ear  and  the  right-ear  respectively.  The  stereausis  images  are  computed 
over  a  period  of  time  and  then  averaged.  Figure  3.8  shows  typical  stereausis  images 
for  an  input  signal  which  is  a  mixture  of  600,  800,  1000  and  1200  Hz  tones.  The 
valleys  and  ridges  in  the  image  represent  a  measure  of  the  output  activity  with 
the  regions  of  ridges  (darker  regions)  depicting  high  correlation  activity  and  vice 
versa. 

There  are  two  important  axes  of  information  in  the  stereausis  image 

•  The  Disparity  or  Lateralization  Axis:  The  axis  normal  to  the  diagonal  AB  in 
Figure  3.8  is  called  the  disparity  or  the  laterization  axis  (represented  by  CD). 
The  correlation  activity  along  the  phase  disparity  axis  shows  the  disparity 
between  the  left  and  right  channel  signals. 

The  stereausis  network  systematically  correlates  the  cochlear  responses  Xi  at 
a  given  CF  location  i  in  one  ear  with  outputs  y.j  from  CF  (j  =  i )  and  off-CF 
(  j  ^  i )  cochelar  locations  of  the  other  ear.  Since  the  off-CF  cochlear  outputs 
represent  the  delayed  versions  of  the  responses  at  CF,  the  elements  along  a 
diagonal  parallel  to  AB  represent  the  correlation  between  the  cochlear  output 
of  one  ear  and  the  spatially  shifted  cochlear  output  of  the  other  ear.  In  other 
words,  the  diagonals  represent  the  correlation  of  the  two  cochlear  images  at 
different  horizontal  spatial  shifts. 

The  distance  of  the  correlation  activity  from  AB  signifies  the  amount  of  spa¬ 
tial  disparity  between  the  left-ear  and  the  right-ear  signals.  Since  this  spatial 
disparity  can  be  interpreted  as  the  temporal  disparity  too,  the  disparity  axis 
is  important  for  ITD  cues.  Figure  3.8(b)  shows  the  stereausis  image  for  an 
off-centered  signal  input.  Comparing  it  with  Figure  3.8(a),  we  see  that  the 
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(a) 


(b) 

Figure  3.8:  Stereausis  images  (a)  source  at  0°  (b)  source  at  22.5°. 
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time  delay  results  in  the  shifting  of  the  ridges  along  the  disparity  axis. 


•  The  Spectral  (CF)  Axis:  The  spectral  axis  is  the  axis  parallel  to  the  diagonal. 
This  axis  provides  information  on  the  spectral  content  of  the  signal.  In  Fig¬ 
ure  3.8,  we  see  high  activity  in  the  channels  corresponding  to  the  frequencies 
in  the  mixture  of  tones. 

It  can  be  shown,  under  some  assumptions,  that  the  diagonals  close  to  the 
center  approximate  the  temporal  correlation  methods  for  ITD  information.  Indeed, 
consider  the  correlation  function  as  follows 

Cij  =  C(xi,yj)  =  ^2xi(k)yj(k)  (3.13) 

k 

Assume  that  a  narrow-band  signal  of  low  frequency  u  impinges  on  the  ear 
drums.  Ignoring  the  nonlinear  structures  of  HWR  and  AGC,  the  output  signals 
for  the  A th  channel  of  the  left-ear  and  the  j-tli  channel  of  the  right-ear  can  be 
expressed  as 


Xi(k) 

=  Ai{oj)  sin (cukT  +  cq(u;)) 

(3.14) 

Vj(k) 

=  Aj(u)  sin(o )kT  +  aij(u)) 

(3,15) 

where  Ai(u)  and  Aj(u)  are  the  amplitude  responses  and  cq(u;)  and  oej(u)  rep¬ 
resent  the  phase  transformations  due  to  cochlear  filters  at  locations  i  and  j,  re¬ 
spectively.  The  sets  {xi(k),i  =  1,  •  •  • ,  N}  and  {y]{k),j  =  1,  •  •  • ,  N}  represent  the 
snapshots  of  the  traveling  waves  at  time  k.  Further,  assume  that  the  channels  i 
and  j  have  close  characteristic  frequencies.  Then,  from  the  shape  of  the  cochlear 
filters  (Figure  3.2)  it  can  be  seen  that  for  low  frequencies,  the  magnitude  responses 
of  the  filters  close  together  in  space  are  nearly  same2.  Thus,  we  can  assume  that 
2It  is  interesting  to  note  that  while  the  stereausis  algorithm  utilizes  the  shape  and  the  overlap 
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Ai(u)  ~  Aj(u).  Next,  defining  aj(u)  =  cq(u;)  —  8a(u),  we  get 

Dj(k )  ~  Ai(u)  sin((u;/cT  +  cq(o>)  —  Sa(u}))  (3.16) 

Again,  assuming  that  the  velocity  of  the  traveling  waves  is  constant  over  the 
small  distance  between  the  i-th  and  j'-th  locations  on  basilar  membrane,  the  phase 
difference  8a(u>)  can  be  expresssed  as 

5a(u>)  =  ojts  (3-17) 

where  rs  is  the  time  taken  by  the  traveling  to  travel  between  the  two  locations, 
i  and  j.  Thus,  yj(k)  can  be  re-written  as 

Vj(k)  ~  Ai(u)  sm((ukT  +  ai(u)  —  uts)  (3.18) 

=  Ui{kT  -  ts)  (3.19) 

Thus  yj{k)  is  a  delayed  version  of  y, ( k).  The  spatial  correlation  as  dehned 
above  is  therefore  equivalent  to  the  temporal  correlation 

Cij  =  Yhxi{k)yi{k  -  ts)  (3.20) 

k 

The  above  analysis  shows  that  as  long  as  the  two  channels  are  not  far  enough, 
the  elements  of  the  image  represent  a  measure  of  temporal  correlation  that  can  be 
used  to  measure  the  interaural  time  difference.  In  other  words,  the  elements  that 
are  close  to  the  center  diagonal  AB  are  important  for  the  detection  of  ITD  cue. 

A  simple  method  similar  to  the  temporal-correlation  method  was  used  to  mea¬ 
sure  the  ITD  from  the  spatially  correlated  outputs  from  the  stereausis  network. 
The  elements  along  a  diagonal  were  summed  together.  The  sum  represented  the 

of  the  cochlear  filters,  the  algorithm  by  Lim  and  Duda  [7]  ignores  the  overlap  and  treats  the 
filters  as  approximations  to  DFT. 
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correlation  between  spatially  shifted  cochlear  outputs.  The  sums  for  different  diag¬ 
onals  represented  the  correlation  at  different  spatial  shifts.  They  were  then  plotted 
along  the  disparity  axis  and  after  a  post  processing  to  suppress  dual  peaks  in  the 
plot,  the  peak  was  searched.  The  distance  of  the  peak  from  the  diagonal  was  used 
as  a  measure  of  ITD. 

The  ILD  cue  depends  on  the  kind  of  the  correlation  function  C(-,  •)  used  for 
forming  the  spatial  image.  We  now  show  that  the  multiplicative  correlation  func¬ 
tion  used  above  does  not  provide  good  ILD  cues.  Assume  that  we  have  a  constant 
ILD,  so  that  the  cochlear  output  for  the  right-ear  is  just  a  scaled  value  of  the 
left-ear  output,  i.e, 


Vi(k)  =  axi(k ) 

(3.21) 

Vj(k)  =  axj(k) 

(3.22) 

where  a  is  the  scalar  factor  representing  ILD.  Then  for  the  multiplicative  cor¬ 
relation  function,  cy,  and  cJt  are  given  by 

Cij  =  y;  axi(k)x1(k)  (3.23) 

k 

Cji  =  y  axi(k)xj(k)  (3.24) 

k 

As  is  obvious  from  the  above  equations,  ctJ  =  Cji.  Therefore,  such  a  correlation 
function  will  not  provide  any  asymmetry  around  the  diagonal  AB  which  can  be  used 
to  detect  the  ILD  cue.  Different  correlation  functions  such  as  addition  C(xi,yj)  = 
J2k(xi(k)  +  Vj(k))  can  be  used  instead  [1]. 

In  this  thesis,  for  the  purpose  of  measuring  the  ILD  we  follow  the  same  method¬ 
ology  as  in  Section  3.3.  A  vector  consisting  of  ILD  values  is  formed  by  taking  the 
ratio  of  signal  energies  in  each  of  the  channels. 

ILDi  =  10(log10(y  Xi(k)xi(k))  -  log10(y  yi(k)yi(k)))  (3.25) 

k  k 
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The  ITD  and  the  ILDs  are  used  to  form  the  ITF  vector.  The  rest  of  the 


procedure  for  training  and  estimating  the  direction  remains  the  same  as  in  the 
temporal  correlation  case3. 


3In  this  thesis,  we  have  used  a  simple  method  for  binaural  processing.  The  stereausis  image 
is  highly  informative  and  a  much  more  sophisticated  processing  can  be  used  to  extract  the  sound 
localization  information.  Please  refer  to  [12,  13]  for  more  details. 
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Chapter  4 


Subspace  Methods 

4.1  Introduction 

The  algorithms  described  in  the  previous  chapter  are  based  on  matching  of  mea¬ 
sured  interaural  transfer  function  with  a  known  set  of  interaural  transfer  functions. 
These  algorithms  inspired  us  to  explore  statistical  signal  processing  tools  to  com¬ 
pute  the  interaural  transfer  functions  and  follow  the  same  procedure  of  exhaustive 
search  for  finding  the  closest  match.  The  statistical  methods  provide  effective 
techniques  to  tackle  measurement  noise  inevitable  in  all  practical  systems. 

A  simple  way  of  measuring  the  interaural  transfer  function  is  to  compute  the 
short-time  DFT  coefficients  of  the  signals  received  at  the  two  sensors  and  take  their 
ratios  which  will  give  the  short-time  DFT  coefficients  for  the  interaural  transfer. 
One  can  then  average  these  coefficients  over  a  period  of  time  to  get  the  statistical 
mean.  This  technique  is  akin  to  averaged  periodogram  methods  which  have  been 
shown  to  perform  poorly  in  comparison  to  parameteric  methods  [8] .  Thus,  we  have 
used  parameteric  subspace  methods,  specifically  MUSIC  and  ESPRIT,  as  the  main 
tools  for  the  spectral  estimation  of  the  interaural  transfer  function. 


40 


To  get  started,  we  will  first  derive  the  data  model  for  the  subspace  methods. 
After  that,  the  genesis  of  subspace  methods,  MUSIC  will  be  described  and  the 
concept  of  signal  subspace  will  be  explained.  Later,  we  will  talk  about  ESPRIT 
and  the  methods  to  tackle  the  problem  at  hand. 

4.2  The  Data  Model 

The  popular  DOA  methods  including  the  subspace  methods  assume  that  the  im¬ 
pinging  signals  are  narrowband.  However  this  assumption  is  not  true  in  the  case  of 
sound  signals.  For  the  ease  of  presentation,  the  narrowband  model  is  first  derived 
and  then  extended  to  form  the  wideband  model. 

4.2.1  The  Narrowband  Model 

A  number  of  assumptions  are  made  to  simplify  the  derivation  of  the  model  equa¬ 
tion.  Some  of  these  assumptions  are  given  below. 

The  transmission  medium  is  assumed  to  be  homogenous  and  isotropic.  The 
sources  are  assumed  to  be  in  the  far  held  of  the  array.  Hence  the  signals  received 
by  the  sensors  are  plane  waves.  The  signals  are  assumed  to  be  sample  functions 
of  narrowband  stationary  processes  with  center  frequencies,  uy,  for  the  i-th  signal. 
Thus,  the  i-th  signal  Si(t)  G  C  can  be  written  as 

Si(t)  =  Ui(t)ej{u,it+Vi (t))  (4.1) 

where  iq(t)  and  Vi(t)  are  “slowly  varying”  functions  of  time  such  that  for  small 
propagation  delays  r*,  the  following  conditions  are  true 

Ui(t  -  Ti )  «  Ui(t) 

Vi(t  -  Ti)  «  Vi(t) 
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Then, 


Si(t  -  Ti) 


Ui{t  -  r<)eJ'(wi(t_Ti)+,,i(t"Ti)) 


«  Ui(t)ej^t+Mt))e~juJ'Ti 


which  can  be  written  as 


Si{t  -  Ti)  «  Si(i)e  (4.2) 

Thus,  the  effect  of  small  time  delays  for  narrowband  signals  is  simply  a  phase- 
shift.  Regarding  the  sensor  array,  we  assume  that  the  sensor  elements  and  the 
head-related  response  can  be  modeled  as  linear  time- invariant  systems  having  linear 
transfer  functions.  It  is  important  to  mention  that  these  transfer  functions  have 
spatial  characterstics  which  means  that  the  transfer  functions  may  be  different  for 
signals  arriving  from  different  directions. 

With  above  assumptions  in  place,  let  us  consider  L  narrowband  signals  with 
known  center  frequencies  {oJi)f=l  impinging  on  an  array  of  M  sensors  from  direc¬ 
tions  {&i}f=1-  Since  the  signal  i  is  sampled  both  in  time  and  space  (by  spatially 
distributed  sensors),  it  is  important  to  specify  it  in  both  parameters.  So  let  us 
denote  Si(t)  as  the  value  of  x-tli  signal  waveform  at  a  reference  point  in  space,  at 
time  t.  The  reference  point  is  normally  one  of  the  sensors  in  the  array. 

Let  Tki  be  relative  delay  of  the  x-t  li  waveform  in  reaching  sensor  k.  Then,  using 
the  superposition  principle,  the  output  of  sensor  k  can  be  written  as 

L 

%k  if)  =  J2  hki(t)  *Siit~  Tki)  +  ek(t)  (4.3) 

i= 1 

where  hki(t)  is  the  combined  impulse  response  of  the  sensor  k  and  the  head 
(HRTF)  to  signal  %.  The  impulse  response  depends  on  the  direction-of- arrival  of 
the  signal,  and  hence  the  subscript  i  in  the  impulse  response.  ek(t)  is  the  additive 
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measurement  noise.  Assuming  that  the  propagation  delay,  r^,  is  small  over  the 
extent  of  the  array  such  that  (4.2)  holds,  the  above  equation  can  be  re-written  as 

L 

Zk(t )  =  Ylhki{t)  *  Si(t)e~3U)iTki  +  ek(t)  (4.4) 

i= 1 

This  equation  can  be  simplified  further  using  the  narrowband  assumption. 
Since  the  spectrum  of  Si(t )  is  centered  around  uji  and  falls  off  rapidly  for  increasing 
\u)  —  uji\,  the  convolution  with  h^iit)  can  be  replaced  by  multiplication  with  complex 
gain  Hki(u>i)  at  frequency  u>i.  Thus,  (4.4)  becomes 

L 

Zk(t)  =  J2Hki(UJi)Si(t)e~JUJiTki  +  ek(t)  (4-5) 

i= 1 

Next,  we  introduce  a  few  vector  notations  to  facilitate  writing  the  ouput  equa¬ 
tions  for  all  M  sensors  in  a  compact  form. 

1.  The  output  vector 

z(t)  =  [z1{t),...,zM(t)]T  (4.6) 

2.  The  signal  vector 

s{t)  =  [si(t),...,sL(t)]T  (4.7) 

3.  The  additive  measurement  noise  vector 

e{t)  =  [ei  (t),...,eM(f)]T  (4.8) 


4.  The  array  steering  column  vector 


a(0i) 


'Hu(ui)e 


-JUiTu 


•  HMi(ui)e 


i  =  (4.9) 


The  center  frequencies,  ujt,  are  assumed  to  be  known.  The  gain  at  microphone 
k  to  signal  i  at  frequency  a;*,  Hki(oOi)  depends  only  on  the  angle-of-arrival  9i 
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of  the  z-tli  signal.  Further,  if  the  array  geometry  is  assumed  to  be  known, 
the  propagation  delays,  r^,  depend  only  on  the  9t.  Thus,  the  array  steering 
vector  is  a  function  of  d*  only. 

5.  The  array  steering  matrix 

A(8)  =  [a(e1),...,a(eL)]  (4.10) 

Using  the  above  vector  notations,  we  can  combine  the  output  equations  for  all 
M  sensors  and  write  the  output  data  model  for  narrowband  signals  as 

z(t)  =  A(9)s(t)  +  e(t)  (4-11) 

4.2.2  The  Wide-band  Model 

We  now  assume  that  the  source  signals  impinging  on  the  array  are  wide-band. 
Using  the  same  notation  as  in  the  case  of  narrow-band  sources,  the  signal  received 
at  the  k-th  sensor  can  be  expressed  as 

L 

Zk (t)  =  hki(t)  *Si(t-  Tki)  +  ek(t)  (4.12) 

1=1 

Unlike  the  narrow-band  case,  it  is  more  convenient  to  represent  the  model  in 
frequency  domain.  Assume  that  the  source  signals  and  the  received  signals  have  a 
Fourier  series  representation,  then  the  above  relation  can  be  expressed  as 

L 

Zk{u,6)  =  Y.HkMe-^S^  +  Ektu)  (4.13) 

i= 1 
L  _ 

=  +  (4.i4) 

i= 1 

where,  Hki(u>)  =  Hki{ijo)e~^Tki .  In  matrix  notation,  the  above  equation  be¬ 
comes 

Z(w,d)  =  A(u,d)S(w)  +  E(u)  (4.15) 
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where, 

Z{v,0)  =  [Z1(o;,0),...,ZM(u;,d)]T 
S(  u)  =  [S1(u),...,Sl(v)]t 
E(u)  =  [E1(u),...,Em(u)]t 

and  the  matrix  A(u>,  9)  is  given  by 

#n  M  •••  #ilH 

H2i(uj)e~ju}T21  •••  H2L(uj)e~ju}T2L 

Hm  i(uj)e~ju}TM1  ■■■  HML(w)e~ju,™j 

Observe  that  each  column  of  frequency-domain  array  steering  matrix  A(u,  9)  is 
associated  with  a  different  source.  The  subspace  spanned  by  array  steering  matrix 
is  called  signal  subspace.  A  quick  look  at  (4.15)  shows  that  in  the  absence  of  noise 
term  E(u),  the  output  vector  belongs  to  the  subspace  of  matrix  A(u,0).  This 
concept  is  elaborated  upon  in  the  next  section.  Further,  note  that  the  columns 
of  A(u,  9)  span  different  spaces  at  different  frequencies  even  if  the  sensors  and 
HRTF  have  flat,  omni-directional  frequency  response.  This  property  makes  it 
extremely  difficult  to  combine  the  subspaces  of  different  frequencies  for  coherent 
DOA  estimation. 

4.3  MUSIC 

MUSIC  [5]  algorithm  was  proposed  by  R.  O.  Schmidt.  It  is  derived  using  the 
correlation  structure  of  the  output  data.  Consider  the  wide-band  model  (4.15), 
reproduced  here  for  convenience. 

Z(w,G)  =  A(u,  9)S(co)  +  E(u>)  (4.20) 


A(u,0)  = 


(4.16) 

(4.17) 

(4.18) 


(4.19) 
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Assuming  that  the  vector  S(u)  consists  of  complex  envelope  of  L(<  M )  uncor¬ 
related  zero-mean  source  signals  St(uj)  at  frequency  u  and  the  frequency-domain 
additive  measurement  noise  vector  E(uS)  is  white  with  zero  mean  and  variance  a2, 
we  can  express  the  correlation  matrix  of  Z(u,  6)  as  follows 

R(uj,6)  =  A(ou,9)Rs(uj)AH(uu,9)  +  a2I  (4.21) 

where  I  is  the  M-hj-M  identity  matrix,  Rs(u)  is  the  correlation  matrix  of  signal 
vector  S(u),  and  the  superscript  H  denotes  transpose  and  complex  conjugation. 
Since  the  signals  St(uj)  are  uncorrelated,  Rs(u)  is  a  diagonal  matrix 

Rs(u)  =  diag{Pi(a;), . . . ,  PL(u>)}  (4.22) 

where  Pt(uj)  =  E,[|lSi(u;)|2],  i  =  1, . . . ,  L  is  the  spectral  power  density  of  the  i-th 
signal. 

To  this  end,  let  Ai  >  A2  >  . . .  >  A m  denote  the  eigenvalues  of  the  correlation 
matrix  R(u),  and  V\  >  Vo  >  ...  >  vm  denote  the  eigenvalues  of  the  matrix 
A(ui,  6)Rs{oj)Ah (tx),  6),  then  from  (4.21),  we  get 

A  i  =  Vi  +  a2,  i  =  l,2,...,M  (4.23) 

Now,  since  Rs(u)  is  a  diagonal  matrix  with  rank  L,  the  smallest  (M  —  L ) 
eigenvalues  of  A(ou,  9)Rs(u>)AH (cu,  6)  are  zero,  i.e,  vl+i  =  ...  =  vm  =  0.  Rewrite 
(4.21)  in  the  following  form 

R(u>,  9)  —  a2 1  =  A(uj,6)Rs(uj)AH(uj,d)  (4.24) 

Then,  if  {qi(u>,9),l  =  1,  •  •  • ,  M}  denote  the  eigenvectors  of  R(u>,  9),  we  get 

{R{u,6)-a2I)qi{w,e)  =  A(tv,6)Rs(u;)AH(Lu,9)ql(iu,6)  (4.25) 
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Equivalently, 


\iqi(uj,6)  -  a2qi(uj,6)  =  A(lu ,  9) Rs(cu) AH (u,9)qi(u,0)  (4.26) 

From  the  above  discussion,  it  follows  that  the  eigenvectors  of  R(u>,  9)  associated 
with  the  smallest  (M  —  L)  eigenvalues  satisfy  the  following  relationship 

A(u,9)Rs(uj)AH(tu,9)ql(uj,9)  =  0,  l  =  (4.27) 

Since  the  matrix  Rs(uj)  is  a  real- valued  diagonal  matrix  of  full  rank,  it  follows 
from  the  above  equation  that 

AH(u,9)qi{u,9 )  =  0,  l  —  L  +  1, . . . ,  M  (4.28) 

or  equivalently  from  the  definition  of  Ah(uj,  9), 

aH  (u,9i)qi(u},9)  =  0,  l  =  L  +  1, . . . ,  M,  i  =  (4.29) 

If  we  define  Qn(oj,9)  and  Qs(w,9)  as  follows 

Qn(uj,9)  =  [qL+1(u,9),...,qM(v,9)],  (4.30) 

Qs(uj,9)  =  [g!(w,0),...,gL(o;,0)]  (4.31) 

Then  from  (4.29),  we  get 

Qn(uj,  9)a(u,  9i)  =  0,  i  =  l,...,L  (4.32) 

From  the  above  equation,  we  make  the  following  key  observation.  The  DOAs, 
9i,  are  the  roots  of  the  following  equation 

aH(uj:9i)QN(iv,9)Q^(<uj,9)a(uj19i)  =  0,  i  —  (4.33) 
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4.3.1  Signal  and  Noise  Subspaces 

From  the  relation  (4.32),  we  note  that  the  steering  vector  a(ui,9i)  belongs  to  the 
null  space  of  Qn(u,  9 )  which  is  denoted  by  a(u>,  9,-')  e  J\f(QN(u ;,  9)).  Also,  since  the 
correlation  matrix  R(u>,  9)  is  hermitian,  its  eigenvectors  are  mutually  orthogonal 
and  hence, 

Q%{cu,9)Qs(lo,9)  =  0  (4.34) 

Thus  if  we  denote  the  range  space  of  Qs(w,  9)  by  lZ(Qs(tu,  9)), 

K{Qs(u,9))  =  A f{QN{uj,9))  (4.35) 

and  hence, 

a{u,0i )  e  7 l(Qs(to,9))  (4.36) 

The  range  space,  TZ(Qs(u>,  9)),  spanned  by  the  hrst  L  eigenvectors  (associated 
with  the  L  largest  eigenvalues)  of  the  output  correlation  matrix  R(u,  9)  is  called 
the  signal  subspace  and  the  space,  TZ(Qn(lj,9)),  spanned  by  the  last  (M  —  L ) 
eigenvectors  is  called  the  noise  subspace. 

4.3.2  Direction-of- Arrival  Estimation 

The  equation  (4.33)  provides  a  straightforward  way  of  estimating  the  directions. 
Suppose  a(u 9)  is  known  for  the  complete  range  of  u>  and  9  and  the  noise  sub¬ 
space,  Qn(u,-),  obtained  from  the  data  received  from  the  microphones.  Then,  an 
exhaustive  search  is  done  to  find  L  array  steering  vectors  which  are  most  orthog¬ 
onal  to  the  noise  subspace,  i.e,  the  L  vectors  which  provide  the  least  values  to  the 
expression  aH(ui,  9i)QN(iv,  -)a(uj,  9j).  The  angles  corresponding  to  the  L 

most  orthogonal  array  steering  vectors  are  the  estimated  directions  of  the  sources. 
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The  determination  of  array  steering  vectors,  a(u,0),  however  requires  precise 
knowledge  of  HRTFs  and  the  sensor  responses  and,  thus,  is  a  long,  time-consuming 
process.  It  is  preferable  to  use  (4.34)  instead  of  (4.33)  since  the  signal  subspace 
can  be  determined  using  a  similar  system  set-up  as  needed  for  estimation.  We  now 
describe  a  practical  implementation  of  localization  system  based  on  MUSIC. 

Training  Phase 

As  in  the  biological  algorithms,  we  go  through  a  training  process  to  determine 
the  training  set  of  signal  subspace  Qs(uj,9)  for  different  values  of  0  G  0.  Due  to 
practical  limitations,  the  training  set  can  only  be  recorded  for  discrete  frequencies 
ui  =  ujn.  Hence,  we  denote  the  training  set  consisting  of  signal  subspaces  as 
{Qs(wn,  0),LJn  G  B,  0  G  0},  where  B  is  the  set  of  discrete  frequencies  covering  the 
bandwidth  of  the  signals. 

A  broadband  white  noise  is  transmitted  and  the  data  received  by  the  micro¬ 
phones  is  recorded  at  different  angles  of  arrival  of  the  white  noise.  For  each  angle 
9,  the  data  in  time  series  is  converted  to  frequency  domain  by  taking  the  DFT  for 
frequencies  ujn.  The  correlation  matrix  R(u)n,  9)  is  formed.  The  eigenvector  cor¬ 
responding  to  the  largest  eigenvalue  of  R(ujn,  6)  is  the  estimated  signal  subspace 
Qs(ujn,9 )  at  angle  9.  Such  a  set  of  vectors  representing  signal  subspaces  form  the 
training  data  and  is  later  used  during  the  estimation  of  the  directions  of  unknown 
sources. 

Estimation  Phase 

The  important  steps  in  the  estimation  phase  of  MUSIC  algorithm  are  summarized 
as  follows: 
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1.  Convert  the  time  series  of  output  data  obtained  from  the  sensors  into  fre¬ 
quency  domain,  Z(un,  9),  ivn  G  B.  The  notation  9  =  [6\,  ■  ■  ■ ,  #l]  is  the  set  of 
unknown  directions. 

2.  Estimate  the  frequency  domain  correlation  matrix 

R(u>n,9)  =  ZH(un,9)Z(un,6) 


3.  Compute  the  eigenvectors  of  R(u>n,9)  using  eigendecomposition  methods. 

4.  Estimate  the  matrix  QN(cjn,9)  from  the  (M  —  L)  eigenvectors  associated 
with  the  smallest  (M  —  L)  eigenvalues1. 


5.  Assuming  that  the  training  process  is  already  completed,  determine  the 
angles-of- arrivals,  9i,  by  searching  for  the  subspace  vectors  which  are  or¬ 
thogonal  to  the  noise  subspace.  In  spectral  MUSIC,  the  search  is  performed 
by  plotting  the  following  expression,  known  as  pseudo-spectrum,  over  the 
domain  of  9. 


J  9^) 


1 

Qs  (w»)  9)QN(ujn,  9)Q%(un,  9)Qs(tvn,  9) 


(4.37) 


Then  it  follows  from  (4.34),  that  the  angles,  9t,  can  be  estimated  as  the  peaks 
in  the  plot  of  J(un,  9). 


6.  Since  each  frequency  u>n  will  provide  a  pseudo-spectrum  J(u>n,9 )  and  its 
own  set  of  directions,  a  scheme  is  needed  to  combine  the  results  of  all  the 

Tn  the  above  analysis  we  assumed  that  L  is  known.  But  there  are  many  practical  situations 
in  which  the  number  of  sources  is  unknown.  In  that  case,  we  need  to  estimate  L.  One  way  to  do 
this  is  to  calculate  the  eigenvalues  of  R( ui,  0)  and  taking  the  number  of  eigenvalues  greater  than 
a  pre-defined  threshold  as  the  number  of  sources. 
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frequencies.  The  approach  of  “averaging”  the  pseudo-spectrums  is  followed 
in  order  to  take  into  account  all  the  frequencies.  The  final  estimates  of 
directions  are  obtained  from  the  peaks  of  the  average: 


m 


i 

Qs  ^)Qjv(wn,  @)Qn( uni  6)Qs(uni  6) 


(4.38) 


4.4  ESPRIT 

The  ESPRIT  [6]  algorithm  is  another  popular  subspace  alogrithm  for  direction-of- 
arrival  problem.  It  was  introduced  by  Roy  and  Kailath  and  is  similar  to  MUSIC 
in  that  it  exploits  the  underlying  data  model.  The  ESPRIT  algorithm  can  achieve 
significant  computational  and  storage  cost  advantages  over  MUSIC  by  requiring 
that  the  sensors  occur  in  matched  pairs  with  identical  displacement  vectors.  How¬ 
ever,  in  our  case,  the  sensors  see  different  HRTFs  and  the  transfer  characteristics 
of  the  two  sensors  are  not  same.  A  search  procedure  as  in  the  previous  methods  is 
employed  to  overcome  this  problem  which  reduces  the  computational  advantages 
of  ESPRIT  over  MUSIC.  Still,  the  processing  in  ESPRIT  is  different  from  MUSIC. 
Moreover,  it  is  closer  in  spirit  to  the  biological  algorithms  and  incorporates  the 
concept  of  interaural  transfer  function  naturally. 

Consider  an  array  of  M  sensors  in  which  the  sensors  can  be  grouped  in  doublets 
such  that  the  displacement  between  the  sensor  elements  in  a  doublet  is  constant 
both  in  magnitude  and  direction  for  all  the  doublets.  The  exact  location  of  the 
doublet  pairs  in  the  space  is  not  important. 

It  shall  be  convenient  to  visualize  the  array  as  being  comprised  of  two  subarrays 
Zx  and  Zy,  identical  in  geometry  and  response  characteristics  but  translated  in 
space  by  a  fixed  displacement  vector.  Let  A  be  the  displacement  vector  with 
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magnitude  d.  Under  this  special  array  geometry,  we  redefine  our  data  model  (4.11) 
as  follows. 

Assume  that  the  array  is  illuminated  by  L  wide-band  plane  waveforms.  Let 
Hki{oJ )  be  the  complex  response  of  the  first  sensor  in  the  Mh  doublet  and  the 
corresponding  HRTF  to  the  i-tli  wavefront  incident  from  the  direction  measured 
with  respect  to  the  normal  to  the  displacement  vector  A.  Then,  proceeding  as  in 
previous  section,  the  Fourier  transform  of  the  output  signal  received  at  the  first 
sensor  of  the  Mil  doublet  can  be  expressed  as 

Xk(u,0)  =  +  (4.39) 

i= 1 

where  T/,..,  is  the  propagation  delay  for  waveform  i  from  the  reference  point  to  the 
first  element  in  the  Mth  doublet  and  EXk(uj)  represents  the  additive  measurement 
noise.  Since  the  second  sensor  element  in  the  doublet  is  displaced  further  by  the 
distance  d  from  the  first  element,  the  signal  received  by  the  second  sensor  will  be 
delayed  further  by  time  d  sin  0,  / c,  where  c  is  the  speed  of  propagation  of  sound. 
The  received  signal  at  the  second  sensor  is  given  by 

Yk(u,9)  =  ^Fw(W)e-^F(  u^Je-^^S^  +  EyYu;)  (4.40) 

i= 1 

where  F(uj,8i)  =  F{uj,9i)e~^udsine^c  represents  the  “interaural  transfer  func¬ 
tion”  between  the  two  sensors  in  the  Mil  doublet.  The  additive  measurement  noise 
corresponding  to  EXk(u),  Eyk(uj)  are  uncorrelated  with  signals  and  are  assumed  to 
be  stationary  zero-mean  spatially  white  random  processes  with  a  known  covariance. 

Using  the  matrix  notation  to  express  equations  (4.39)  and  (4.40)  for  k  = 
1, . . . ,  M/2,  the  output  of  the  array  can  be  expressed  as 

X(u,8)  =  A(lu,6)S(cu)  +  Ex(u)  (4.41) 

Y{u,9)  =  A(u,O)$(w,0)S(u)  +  Ey(u)  (4.42) 
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where, 


X(u,  9)  G  CM'2  is  the  output  vector  of  first  sensors  at  frequency  u ; 
Y(u,  9)  G  CM!2  is  the  output  vector  of  second  sensors  in  each  doublet 
A(u,9)  G  CM/2xL  is  an  unknown  matrix  of  array  steering  vectors 
Ex(co),  Ey(u )  G  CM'2  are  the  measurement  noise  vectors 
$(w,e)  =  diag  {|F(u;,di)|e-^1^,...,|F(a;,dL)|e-^^)}, 


$(u;,  9)  is  the  matrix  of  interaural  transfer  functions  with 


9i)  =  UJdsi^()l  4.  arg(F(u;,  0*))  (4.43) 

Next,  combining  equations  (4.41)  and  (4.42),  we  define  the  data  model  for  the 
problem  as 


Z(u,9) 

A(u,9) 


X(u,9 ) 
Y{u,9) 


A(u>,  9)S(u)  +  E(oj) 


A(u,9) 

A(uj,9)$(uj,9) 


E(u) 


Ex{u) 

Ey^) 


(4.44) 


(4.45) 


The  objective  is  to  estimate  the  number  of  signals  L  and  the  directions-of- 
arrival  9t.  For  this  it  is  sufficient  to  estimate  the  matrix  <F(o;,  9).  It  is  the  structure 
of  the  matrix  A(cu,9)  that  is  exploited  to  obtain  $(w,d)  without  having  to  know 
the  exact  HRTF  and  the  sensor  transfer  functions  in  A(u>,9). 

Proceeding  as  in  the  case  of  MUSIC  algorithm,  we  compute  the  correlation 
matrix  R(u,  9)  of  total  array  output  vector  Z(u,9).  The  L  eigenvectors  of  R(u,  9), 
denoted  by  {qi(u,9),  l  =  1,...,L}  corresponding  to  L  largest  eigenvalues  are 
used  to  obtain  the  signal  subspace  7 Z(Qs{w,9))  at  frequency  u>,  where  Qs(oo,9)  = 
[qi(tv,  9), . . . ,  <7l( uj,  9)].  Since  TZ(Qs(w>,  9))  =  9)),  there  must  exist  a  unique 
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nonsingular  matrix  T  such  that 


Qs(u,0)  =  A(u,9)T 


A(cj,d)T 

A(u,9)$(u>,6)T 


Partitioning  Qs(tu,0)  into  Qx  G  CM/2xL  and  Qy  G  CM^2xL,  we  get 


Qx 

1 - 

S' 

i _ 

Qy 

A(u},6)$(u,0)T 

It  follows  from  the  above  equation  that 


K(Qx)  =  K(Qy)  =  K(A(u>,0)) 


Next,  define 


Qxy  —  [Qx\Qy] 

Since  Qx  and  Qy  span  the  same  column  space,  the  rank  of  Qxy  is 
Qxy  G  Cm/2x2L,  the  null  space  of  Qxy  has  dimension  L.  Let  G  G  C2Lx 
L  span  the  null  space  of  Qxy,  denoted  by  M{Qxy),  then 

[Qx\Qy]G  =  0 

Partitioning  GH  =  [Gx\Gy],  where  Gx,  Gy  G  CLxL,  we  get 

QxGX  +  QyGy  =  0 
=>  A(uj,6)TGx  +  A(uj,6)$(uj,e)TGy  =  0 


Since  G  is  rank  L,  the  matrices  Gx  and  Gy  are  nonsingular.  Define 

T  =  -GxGy1 

Then  (4.52)  can  be  expressed  as 

-A(u,0)TV +  A(u,0)$(w,0)T  =  0 


(4.46) 

(4.47) 

(4.48) 

(4.49) 

'.  Since 
of  rank 

(4.50) 

(4.51) 

(4.52) 

(4.53) 

(4.54) 
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Rearranging,  we  get 


A(u;,fl)$(w,0)  =  A(w,  9)T'i>T~1  (4.55) 

Finally,  assuming  A(uj,  9)  to  be  full  rank,  we  get 

$(u ;,0)  =  TTT"1  (4.56) 

Thus  the  eigenvalues  of  T  are  the  same  as  the  diagonal  elements  of  <3>(u;,  9). 

4.4.1  Direction-of- Arrival  Estimation 

Given  the  diagonal  elements  of  $(u;,  9),  the  interaural  trasfer  functions  F(iv,  9i )  can 
be  determined  from  (4.43).  However,  the  term  F{uo,9i)  in  the  interaural  transfer 
function  depends  on  the  response  characteristics  of  the  sensors  and  the  surround¬ 
ings  and  is  generally  unknown;  thus  making  it  impossible  to  directly  find  the 
directions,  9,.  We,  therefore,  follow  the  same  two-phase  scheme  as  in  the  the  case 
of  earlier  methods. 

Training  Phase 

In  the  first  phase,  the  system  undergoes  a  training  process  that  provides  a  set  of 
interaural  transfer  functions,  {F(O') .  9  £  0}.  The  notation  F(O')  represents  the 
vector  {F(ujn,  9),  ujn  £  B}. 

A  broadband  white  noise  is  transmitted  and  the  data  received  by  the  micro¬ 
phones  is  recorded  at  different  angles  of  arrival  of  the  white  noise.  For  each  angle 
9,  the  data  in  time  series  is  converted  to  frequency  domain  by  taking  the  DFT  for 
frequencies  un.  For  each  frequency  un,  the  correlation  matrix  R(oon,  9)  is  formed. 
The  eigenvector  corresponding  to  the  largest  eigenvalue  of  R(ujn,  9)  gives  estimated 
signal  subspace  Qs(un,  9)  at  angle  9.  Then,  the  equations  (4.49),  (4.50),  (4.53)  and 
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(4.56)  are  used  to  compute  F(un,9).  Repeating  the  process  for  all  u>n  G  B  and 
9  G  0  provides  the  training  set. 

Estimation  Phase 

The  computational  steps  in  the  estimation  phase  of  ESPRIT  are  summarized  as 
follows: 

1.  Convert  the  time  series  of  output  data  obtained  from  the  sensors  into  fre¬ 
quency  domain,  Z(u>n,  9),  oon  G  B.  The  notation  9  =  [#i,  •  •  • ,  #l]  is  the  set  of 
unknown  directions. 

2.  Estimate  the  frequency-domain  correlation  matrix  for  the  frequency  bin  uin, 
R(ujnJ)  =  ZH(ujn,  0)Z(con,  9). 

3.  Compute  the  eigenvectors  of  R(u>n,9 )  using  eigendecomposition  methods. 

4.  Estimate  the  matrix  Qs(wn,9)  from  the  L  eigenvectors  associated  with  the 
largest  L  eigenvalues. 

5.  Partition  QsQjn-,9)  =  [Qx\Qy\H  and  form  matrix  Qxy- 

6.  Compute  the  null  space  of  Qxy-,  given  by  G. 

7.  Partition  GH  =  [Gx\Gy]H  and  compute  T  =  —  GxGy1- 

8.  Compute  the  L  eigenvalues  of  matrix  T. 

9.  Case  L  =  1:  In  the  case  of  single  source,  there  will  be  just  one  eigenvalue 
of  T.  Repeating  the  steps  (2-8)  for  all  frequency  bins,  G  R,  a  vector 
of  eigenvalues  is  formed,  represented  by  F.  Assuming  that  the  training  set 
{F(9),9  G  0}  is  available,  the  nearest-neighbor  approach  as  in  Chapter  3  is 
followed  to  estimate  the  location  of  the  source. 
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10.  Case  L  >  1:  When  L  >  1,  the  picture  becomes  much  more  complicated.  In 
the  case  of  more  than  one  sources,  the  ESPRIT  algorithm  will  give  a  set  of  L 
eigenvalues,  F(ujn,  =  1,  •  •  • ,  L},  for  frequency  un.  Repeating  the  steps 

(2-8)  for  all  frequencies  G  B,  similar  sets  of  eigenvalues  are  obtained.  The 
process  of  obtaining  the  eigenvalues  for  one  frequency  bin  is  independent  of 
that  for  another  frequency  bin.  It  is,  thus,  not  clear  how  to  associate  these 
eigenvalues  with  the  sources.  In  order  to  solve  this  problem,  the  nearest 
neighbor  estimator  is  extended  and  a  global  search  on  all  the  possible  asso¬ 
ciations  of  the  eigenvalues  with  the  sources  is  performed  to  End  the  vectors 
that  minimize  the  distance  measure  between  the  training  set  of  interaural 
transfer  functions  and  the  estimated  interaural  transfer  functions  from  the 
received  data. 


4.5  Tracking  of  moving  source 

Continuous  tracking  is  required  in  applications  in  which  the  source  is  moving.  In 
such  a  case,  the  localization  system  needs  to  continuously  update  the  direction  of 
the  source.  In  subspace  methods,  the  intermediate  step  for  estimating  the  direction 
is  the  computation  of  signal  subspace.  As  the  direction  of  the  source  changes,  so 
does  the  signal  subspace.  An  approach  to  update  the  signal  subspace  is  to  apply  a 
forgetting  factor  0  <  /3  <  1  that  damps  out  the  effects  of  the  older  data  and  gives 
higher  weightage  to  more  recent  data. 

Rt(^n)  —  /3Rt~i(^n)  +  (1  ~ ■  P)Z^(ujn,  6t)Zt(un,  6t)  (4. 57) 

where  Zt(u>n,9t )  is  the  short-time  DFT  vector  computed  from  the  latest  data 
obtained  from  the  microphones,  and  t  is  the  running  time  index.  The  updated 
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matrix  Rt(uJn)  is  used  to  compute  the  new  signal  subspace  using  eigendecompo- 
sition  methods.  This  method,  however,  does  not  make  use  of  the  results  of  the 
previous  subspace  computations  for  Rt-i{ un)  and  is  highly  expensive  in  terms  of 
the  computational  cost. 

A  better  approach  is  to  use  the  data  matrix  for  subspace  computation  instead 
of  the  correlation  matrix.  The  data  matrix  is  formed  by  computing  Zt(un,6t)  at 
discrete  instants  of  time  t  and  stacking  them  in  a  form  of  matrix.  In  order  to  take 
into  account  the  forgetting  factor,  the  old  data  matrix  is  multiplied  by  forgetting 
factor  before  appending  a  column  of  newly  arrived  data. 

Dt{u}n)  =  /?A-i(w„):(l  -  (3)Zt(un,9t)  (4.58) 

It  can  be  easily  seen  that  Rt(u>n )  =  Df  (tvn) Dt(tvn) . 

We  then  follow  the  URV  Decomposition  method  introduced  by  Stewart  [10].  He 
showed  that  there  exists  a  matrix  decomposition,  called  the  URV  decomposition, 
of  Dt{uin)  which  is  of  the  form 

Df  K)  =  UtHAtVt  (4.59) 

where  Ut  G  CtxM  is  the  left  orthogonal  matrix,  At  G  CMxM  is  a  right  triangular 

matrix  and  Vt  G  CMxM  is  the  right  orthogonal  matrix.  The  matrix  At  can  be 
written  as 

s  r 

At  =  1  (4.60) 

o  r  2 

and  satisfies  the  following  properties. 

1.  S  and  T2  are  upper  triangular, 

2.  smaller  singular  value  of  S  «  ^/Xl, 
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3.  vIM2  +  ||r2||2  ~  v/Al+i  +  •  •  •  +  Am- 

where  Ai  >  •  •  •  Aj,  >  Xl+i  >  •  •  •  Am  represent  the  eigenvalues  of  Dt(ojn).  Un¬ 
der  these  conditions,  it  can  be  easily  proved  that  the  subspace  spanned  by  Vt 
is  equal  to  the  space  spanned  by  the  eigenvectors  of  Rt(un )  [10].  Moreover,  the 
subspace  spanned  by  the  first  L  columns  of  V,  is  approximately  equal  to  the  sub¬ 
space  spanned  by  the  L  eigenvectors  of  the  correlation  matrix  Rt{oon)  and  therefore 
represent  the  signal  subspace. 

Everytime  a  new  row  of  data  Z^+l{uin,  6t+i)  is  added  to  the  data  matrix,  the 
matrices  Ut,  A t  and  Vt  need  to  be  updated.  The  updating  of  the  signal  subspace 
does  not  require  the  knowledge  of  the  matrix  Ut.  So  in  the  computations  of  the 
signal  subspace,  Ut  is  completely  ignored.  This  results  in  huge  savings  in  the  com¬ 
putational  cost  and  the  storage  requirements.  The  updating  of  At  and  Vt  with 
the  arrival  of  a  new  row  is  an  0[n 2)  process.  It  is  to  be  compared  with  the  eigen¬ 
value  decomposition  of  the  correlation  matrix  and  the  singular  value  decomposition 
method  which  are  of  the  order  of  0(n3)  process.  Though,  these  methods  are  gen¬ 
erally  more  accurate  than  the  URV  decomposition  method,  Liu  et.  al  [9]  showed 
that  the  results  are  comparable.  For  more  details  on  the  computations  involved  in 
URV  decomposition,  please  refer  to  [10]  and  [9]. 
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Chapter  5 


Results  and  Discussion 


The  experimental  studies  were  conducted  to  evaluate  the  perfomance  of  the  lo¬ 
calization  algorithms  described  in  the  chapters.  The  following  two  set  ups  were 
considered. 

1.  The  effects  of  the  head  and  the  outer  ears  were  simulated  by  using  experi¬ 
mental  HRTF  measurements  from  KEMAR  manikin  [26] .  A  wide-band  noise 
source  was  used  as  the  sound  source  which  was  convoluted  with  the  KEMAR 
HRTFs  to  simulate  the  output  data  of  the  sensors.  Since  the  KEMAR  HRTF 
measurements  were  done  in  a  controlled  anechoic  environment,  the  simula¬ 
tions  in  this  case  follow  rather  ideal  conditions. 

2.  The  data  was  collected  from  the  Scout  robot  (Figure  2.2)  at  the  sampling 
rate  of  40  kHz.  A  wide  band  noise  was  generated  and  reproduced  from  a 
speaker  kept  at  around  3  meters  from  the  robot.  The  measurements  in  this 
case  were  done  in  a  highly- reverberant  room  environment. 
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5.1  Experimental  results  for  KEMAR 


A  wide-band  source  was  simulated  at  angle  20°.  The  sensor  measurement  errors 
were  introduced  by  zero  mean  normal  additive  noise  with  signal-to-noise  ratio 
(SNR)  of  20  dB.  In  the  training  mode,  however,  the  sensor  errors  were  assumed  to 
be  zero.  The  HRTF  provided  data  at  step-size  of  5°.  Therefore,  the  training  set 
consisted  of  72  ITF  vectors. 

For  temporal-correlation  based  (Lim-Duda),  spatial-correlation  based  (stereau- 
sis)  and  ESPRIT  methods,  we  considered  three  cases  where  ITD  only,  ILD  only 
and  both  ITD  and  ILD  cues  were  respectively  used  for  direction  estimation.  The 
histograms  of  the  estimated  directions  were  obtained.  The  results  are  shown  in 
Figure  5.1  thru  Figure  5.4. 

5.2  Experimental  results  for  Scout  robot 

A  wide-band  signal  was  produced  from  the  speaker  kept  at  direction  18°  from  the 
normal  of  the  line  connecting  the  two  microphones  on  the  robot  dummy  head.  The 
data  collected  from  the  microphones  was  used  for  estimating  the  directions.  For 
the  training  mode,  the  same  environment  was  used  and  the  data  was  collected  by 
rotating  the  robot  in  step-size  of  3°  at  angles  {0,  3,  6,  •  •  • ,  357}.  The  histogram  of 
the  estimated  direction  are  shown  in  Figure  5.5  thru  Figure  5.8. 

As  earlier,  for  temporal-correlation  based  (Lim-Duda),  spatial-correlation  based 
(stereausis)  and  ESPRIT  methods,  we  considered  three  cases  where  ITD  only,  ILD 
only  and  both  ITD  and  ILD  cues  were  respectively  used  for  direction  estimation. 
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5.3  Discussion 


The  results  indicate  that  the  estimates  of  the  subspace  based  methods  are  unbi¬ 
ased  compared  to  the  biological  methods.  The  subspace  methods  also  provided 
higher  percentage  of  accurate  estimates  and  therefore  lower  variance.  In  the  case 
of  KEMAR,  experiments  were  performed  with  decreasing  values  of  SNR.  It  was 
found  that  the  degradation  of  performance  for  the  biological  methods  was  higher 
than  that  of  subspace  methods.  In  fact,  at  SNR  of  6  dB,  the  biological  methods 
failed  to  localize  the  sources,  whereas  the  subspace  methods  provided  reasonably 
accurate  results. 

Among  the  biological  algorithms,  it  seems  that  the  temporal-correlation  meth¬ 
ods  are  better  suited  than  spatial-correlation  based  methods.  However,  it  is  pos¬ 
sible  that  the  simplified  method  of  extracting  ITD  used  in  this  thesis  failed  to 
capture  the  necessary  information  embedded  in  the  stereausis  image. 

Among  the  subspace  methods,  ESPRIT  estimates  showed  higher  variance  and 
lesser  percentage  accuracy  than  those  of  MUSIC.  This  points  to  the  fact  that 
the  projection  method  in  MUSIC  performed  better  than  the  least- mean-square 
method  used  in  ESPRIT  for  comparing  the  proximity  of  the  estimation  data  from 
the  training  set. 

It  is  interesting  to  note  that  the  cone  of  confusion  effect  (the  reason  for  peaks 
around  160°)  is  very  visible  in  the  KEMAR  case,  even  when  ILD  only  and  no  ITD  is 
used  for  estimation.  This  is  due  to  the  similarity  in  the  attenuation  characteristics 
of  the  front  part  and  the  rear  part  of  the  KEMAR  dummy  head.  In  the  case 
of  Scout  robot,  there  was  no  such  symmetry.  Therefore,  although  the  peaks  due 
to  cone  effect  were  present  in  the  ITD  only  case,  the  ILD  only  case  didn’t  show 
considerable  effect.  This  helped  in  achieving  better  localization  when  both  ITD 
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and  ILD  were  used. 


The  experiments  show  that  localization  accuracy  is  much  better  in  the  KEMAR 
case  than  the  Scout  robot  measurements.  This  can  be  easily  explained  by  the  fact 
that  in  the  latter  case  the  environment  is  highly  echoic.  The  echoes  can  be  viewed 
as  virtual  sources.  Such  an  environment  presents  multiple  weak  sources  which 
violates  the  assumptions  made  in  the  algorithms1.  Furthermore,  these  virtual 
sources  are  correlated  which  results  in  further  degradation  in  the  performance. 
Nevertheless,  the  results  show  that  good  localization  can  be  achieved,  especially 
by  subspace  methods,  even  in  an  extremely  reverberant  environment  by  using 
training/ estimation  approach. 


^The  subspace  algorithms  can  handle  multiple  sources.  However,  in  our  experiments,  the 
number  of  sensors,  M  =  2.  Therefore,  the  maximum  number  of  sources  that  subspace  algorithm 
can  localize  is  one. 
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Figure  5.1:  Histograms  of  temporal-correlation  method  for  KEMAR 
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Figure  5.2:  Histograms  of  spatial-correlation  method  for  KEMAR 
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Figure  5.3:  Histograms  of  MUSIC  method  for  KEMAR 
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Figure  5.4:  Histograms  of  ESPRIT  method  for  KEMAR 
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Figure  5.5: 


Histograms  of  temporal-correlation  method  for  Scout  robot 
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Figure  5.6:  Histograms  of  spatial-correlation  method  for  Scout  robot 
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Figure  5.7:  Histograms  of  MUSIC  method  for  Scout  robot 
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Figure  5.8:  Histograms  of  ESPRIT  method  for  Scout  robot 
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